Shotgun Metagenomic Data Analysis    ◾    325

annotations and polypeptides and ORFs are written to files. Gene annotation of a new

genome assembly is an important step. Since bacteria have no introns, prediction of ORFs

is easier than in the eukaryotic genome. There are many programs for ORF prediction, but

Prodigal [12] is the most commonly used one. We have installed Prodigal above. Prodigal

can predict ORFs in any genomic sequences. Thus, we can predict the ORFs in assemblies

separated by binning. In the following, we will predict the ORFs in one of the assemblies

recovered from the sample of the patient with severe sickle cell disease.

prodigal -a prod_out/healthy.faa \

-d prod_out/healthy.fnt \

-o prod_out/healthy.gbk \

-s prod_out/genes.gff \

-i binning/severe/severe.1.fa \

-p single

The “-a” option specifies the FASTA file name for the polypeptides or proteins translated

from the predicted ORFs. The “-d” option specifies the FASTA file name of the nucleotide

sequences that represent the predicted ORFs. The “-o” option specifies the predicted ORF

as features in GenBank format. The “-s” option specifies the gene annotation in GFF (gen-

eral feature format). The “-i” option specifies the input file which is the assembly. The “-p”

option specifies the procedure, which is either “single” for a single assembly or “meta” for

metagenomic assembly that may include genomes of multiple species.

8.3  SUMMARY

The metagenomic DNA is isolated from environmental samples or clinical samples in

which several microbes are present. Unlike targeted gene sequencing, shotgun metage-

nomic sequencing allows researchers to sequence the whole genomes of all organisms pres-

ent in a sample and to evaluate the microbial diversity and abundance.

Shotgun metagenomic sequencing attempts to sequence the whole genomes of a large

diverse number of microbes, each with a different genome size. Long reads produced by

PacBio and Oxford Nanopore are preferred. However, they usually have higher error rate

than the short reads. Since there are several species in the metagenomic sample, there

must be a sufficient sequencing depth to allow assembling the genomes of all species in the

sample.

Before analysis, we should make sure that we have fixed any quality problem by trim-

ming adaptors, filtering out low-quality reads, and removing technical sequences. In the

case of clinical samples, we should also remove the host DNA by aligning reads to the host

genome and then separate the unaligned reads in new FASTQ files to be used in the analy-

sis. There are two approaches for the shotgun metagenomic data analysis: the assembly-

free and de novo assembly. The assembly-free approach does not require assembling the

genomes of the species in the sample; it uses reads present in the metagenomic samples

to assign taxonomic groups by identifying unique genomic regions in the reads. Most of

the programs used for taxonomy assignment require a large amount of memory and stor-

age space. The second approach uses de novo algorithms to assemble the genomes of the